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Abstract 

In this paper, we consider the problem of column 
subset selection. We present a novel analysis of 
the spectral norm reconstruction for a simple ran¬ 
domized algorithm and establish a new bound 
that depends explicitly on the sampling probabil¬ 
ities. The sampling dependent error bound (i) al¬ 
lows us to better understand the tradeoff in the 
reconstruction error due to sampling probabili¬ 
ties, (ii) exhibits more insights than existing er¬ 
ror bounds that exploit specific probability distri¬ 
butions, and (iii) implies better sampling distri¬ 
butions. In particular, we show that a sampling 
distribution with probabilities proportional to the 
square root of the statistical leverage scores is al¬ 
ways better than uniform sampling and is better 
than leverage-based sampling when the statisti¬ 
cal leverage scores are very nonuniform. And by 
solving a constrained optimization problem re¬ 
lated to the error bound with an efficient bisec¬ 
tion search we are able to achieve better perfor¬ 
mance than using either the leverage-based dis¬ 
tribution or that proportional to the square root 
of the statistical leverage scores. Numerical sim¬ 
ulations demonstrate the benefits of the new sam¬ 
pling distributions for low-rank matrix approxi¬ 
mation and least square approximation compared 
to state-of-the art algorithms. 


1. Introduction 

Give a data matrix A G column subset selection 

(CSS) is an important technique for constructing a com¬ 
pressed representation and a low rank approximation of 
A by selecting a small number of columns. Compared 
with conventional singular value decomposition (SVD), 
CSS could yield more interpretable output while main¬ 
taining performance as close as SVD (Mahoney, 2011). 
Recently, CSS has been applied successfully to prob¬ 
lems of interest to geneticists such as genotype recon¬ 
struction, identifying substructure in heterogeneous popu¬ 
lations, etc. (Paschouet al., 2007b;a; Drineas et al., 2010; 
laved et al., 2011). 

Let C € be the matrix formed by £ selected columns 

of A. The key question to CSS is how to select the columns 
to minimize the reconstruction error: 

l|A-Perils, 

where Fc = C'C'^ denotes the projection onto the column 
space of C with being the pseudo inverse of C and ^ = 
2 or P denotes the spectral norm or the Erobenius norm. 
In this paper, we are particularly interested in the spectral 
norm reconstruction with respect to a target rank k. 

Our analysis is based on a randomized algorithm that se¬ 
lects £ > k columns from A according to sampling prob¬ 
abilities s = (si,..., s„). Building on advanced matrix 
concentration inequalities (e.g., matrix Chernoff bound and 
Bernstein inequality), we develop a novel analysis of the 
spectral norm reconstruction and establish a sampling de¬ 
pendent relative spectral error bound with a high probabil- 
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ity as following: 

11^ ~ < (1 + e(s))||^ — ^fc||2, 

where Ak is the best rank-fc approximation of A based 
on SVD and e(s) is a quantity dependent on the sampling 
probabilities s besides the scalars n,k,£. As revealed in 
our main theorem (Theorem 1), the quantity e(s) also de¬ 
pends on the statistical leverage scores (SLS) inherent to 
the data, based on which are several important randomized 
algorithms for CSS. 

To the best of our knowledge, this is the first such kind of 
error bound for CSS. Compared with existing error bounds, 
the sampling dependent error bound brings us several ben¬ 
efits: (i) it allows us to better understand the tradeoff in the 
spectral error of reconstruction due to sampling probabili¬ 
ties, complementary to a recent result on the tradeoff from a 
statistical perspective (Ma et al., 2014) for least square re¬ 
gression; (ii) it implies that a distribution with sampling 
probabilities proportional to the square root of the SLS 
is always better than the uniform sampling, and is poten¬ 
tially better than that proportional to the SLS when they 
are skewed; (iii) it motivates an optimization approach by 
solving a constrained optimization problem related to the 
error bound to attain better performance. In addition to 
the theoretical analysis, we also develop an efficient bisec¬ 
tion search algorithm to solve the constrained optimization 
problem for finding better sampling probabilities. 

By combining our analysis with recent developments 
for spectral norm reconstruction of CSS (Boutsidis et al., 
2011), we also establish the same error bound for an exact 
rank-fc approximation, i.e., 

11^ ~ < (1 + e(s))||A — Afc||2, 

where 11^ (A) is the best approximation to A within the 
column space of C that has rank at most k. 

The remainder of the paper is organized as follows. We 
review some closely related work in Section 2, and present 
the main result in Section 4 with some preliminaries in Sec¬ 
tion 3. We conduct some empirical studies in Section 5 and 
present the detailed analysis in Section 6. Finally, conclu¬ 
sion is made. 

2. Related Work 

In this section, we review some previous work on CSS, 
low-rank matrix approximation, and other closely related 
work on randomized algorithms for matrices. We focus our 
discussion on the spectral norm reconstruction. 

Depending on whether the columns are selected deter¬ 
ministically or randomly, the algorithms for CSS can 
be categorized into deterministic algorithms and random¬ 
ized algorithms. Deterministic algorithms select i > k 
columns with some deterministic selection criteria. Rep¬ 


resentative algorithms in this category are rank revealing 
QR factorization and its variants from the filed of nu¬ 
merical linear algebra (Gu & Eisenstat, 1996; Pan, 2000; 
Pan & Tang, 1999). A recent work (Boutsidis et al., 2011) 
based on the dual set spectral sparsification also falls 
into this category which will be discussed shortly. Ran¬ 
domized algorithms usually define sampling probabilities 
s G R” and then select £ > k columns based on these 
sampling probabilities. Representative sampling proba¬ 
bilities include ones that depend the squared Euclidean 
norm of columns (better for Erobenius norm reconstruc¬ 
tion) (Erieze et al., 2004), the squared volume of simplices 
defined by the selected subsets of columns (known as 
volume sampling) (Deshpande & Rademacher, 2010), and 
the SLS (known as leverage-based sampling or subspace 
sampling) (Drineas et al., 2008; Boutsidis et al., 2009). 

Depending on whether £> k is allowed, the error bounds 
for CSS are different. Below, we review several representa¬ 
tive error bounds. If exactly k columns are selected to form 
C, the best bound was achieved by the rank revealing QR 
factorization (Gu & Eisenstat, 1996) with the error bound 
given by: 

||A - Pc Ah < \/l + - mA - (D 

with a running time 0{mnk\ogn). The same er¬ 
ror bound was also achieved by using volume sam¬ 
pling (Deshpande & Rademacher, 2010). The running time 
of volume sampling based algorithms can be made close 
to linear to the size of the target matrix. Boutsidis et al. 
(2009) proposed a two-stage algorithm for selecting ex¬ 
actly k columns and provided error bounds for both the 
spectral norm and the Erobenius norm, where in the firs 
stage 0(fc log k) columns are sampled based on a distribu¬ 
tion related to the SLS and more if for the spectral norm 
reconstruction and in the second stage k columns are se¬ 
lected based on the rank revealing QR factorization. The 
spectral error bound in this work that holds with a constant 
probability 0.8 is following: 

\\A-PcA\\2 < (2) 

0 (fc logi/2 k + \oil^{k^ IIA - Afc II 2 

The time complexity of their algorithm (for the spec¬ 
tral norm reconstruction) is given by ,rri^n)) 

since it requires SVD of the target matrix for computing the 
sampling probabilities. 

If more than k columns are allowed to be selected, i.e., 
£ > k, better error bounds can be achieved. In the most 
recent work by Boutsidis et al. (2011), nearly optimal error 
bounds were shown by selecting £ > k columns with a de¬ 
terministic selection criterion based on the dual set spectral 
sparsification. In particular, a deterministic polynomial¬ 
time algorithm ^ was proposed that achieves the following 

*A slower deterministic algorithm with a time complexity 
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error bound; 

II^-^cA|| 2< P-^fc||2 (3) 

in + 0{nik‘^) time where Ty^, is the time needed to 
compute the top k right singular vectors of A and 0{nik‘^) 
is the time needed to compute the selection scores. This 

bound is close to the lower bound ) ,a > 0 

established in their work. It is worth mentioning that the 
selection scores in (Boutsidis et al., 2011) computed based 
on the dual set spectral sparsification is difficult to under¬ 
stand than the SLS. 

Although our sampling dependent error bound is not di¬ 
rectly comparable to these results, our analysis exhibits 
that the derived error bound could be much better than 
that in (2). When the SLS are nonuniform, our new sam¬ 
pling distributions could lead to a better result than (3). 
Most importantly, the sampling probabilities in our algo¬ 
rithm are only related to the SLS and that can be com¬ 
puted more efficiently (e.g., exactly in O(Ty^) or approx¬ 
imately in 0(mnlogn) (Drineas et al., 2012)). In simula¬ 
tions, we observe that the new sampling distributions could 
yield even better spectral norm reconstruction than the de¬ 
terministic selection criterion in (Boutsidis et al., 2011), es¬ 
pecially when the SLS are nonuniform. 

For low rank matrix approximation, several other random¬ 
ized algorithms have been recently developed. For ex¬ 
ample, Halkoetal. (2011) used a random Gaussian ma¬ 
trix fl G or a subsampled random Fourier trans¬ 

form to construct a matrix 17 and then project A into 
the column space of F = AH, and they established nu¬ 
merous spectral error bounds. Among them is a com¬ 
parable error bound 0{^Jn/(.)\\A — Au \\2 to (3) using 
the subsampled random Fourier transform. Other ran¬ 
domized algorithm for low rank approximation include 
CUR decomposition (Drineas et al., 2008; Wang & Zhang, 
2012; 2013) and the Nystrom based approximation for PSD 
matrices (Drineas & Mahoney, 2005; Gittens & Mahoney, 
2013). 


of the solution to the approximated least square with uni¬ 
form sampling and leverage-based sampling. They found 
that leveraging based estimator could suffer from a large 
variance when the SLS are very nonuniform while uniform 
sampling is less vulnerable to very small SLS. This tradeoff 
is complementary to our observation. However, our obser¬ 
vation follows directly from the spectral norm error bound. 
Moreover, our analysis reveals that the sampling distribu¬ 
tion with probabilities proportional to the square root of 
the SLS is always better than uniform sampling, suggest¬ 
ing that intermediate sampling probabilities between SLS 
and their square roots by solving a constrained optimiza¬ 
tion problem could yield better performance than the mix¬ 
ing strategy that linearly combines the SLS and uniform 
probabilities as suggested in (Ma et al., 2014). 

There are much more work on studying the Frobenius 
norm reconstruction of CSS (Drineas et al., 2006a; 
Guruswami & Sinop, 2012; Boutsidis et al., 2011; 
Drineas et al., 2008; Boutsidis et al., 2009). For more 
references, we refer the reader to the survey (Mahoney, 
2011). It remains an interesting question to establish 
sampling dependent error bounds for other randomized 
matrix algorithms. 

3. Preliminaries 


Let A G be a matrix of size mx n and has a rank of 

p < min(m, n). Let fc < p be a target rank to approximate 
A. We write the SVD decomposition of A as 

El 0 

0 E 2 


A = U 




0—k) X (p—k) 


Ui G 


pn X k 


and 


where Si G S 2 G 

V 2 G We use cri,cr 2 ,... to denote the sin¬ 

gular values of A in the descending order, and Amax(Ar) 
and Amin(Ai) to denote the maximum and minimum eigen¬ 
values of a PSD matrix X. For any orthogonal matrix 


U G 


pnx^ 


let G 


hn X {n—l 


denote an orthogonal ma¬ 


trix whose columns are an orthonormal basis spanning the 
subspace of R" that is orthogonal to the column space of 
U. 


Besides low rank matrix approximation and column se¬ 
lection, CSS has also been successfully applied to least 
square approximation, leading to faster and interpretable 
algorithms for over-constrained least square regression. In 
particular, if let 17 G R^^™ denote a scaled sampling ma¬ 
trix corresponding to selecting i < m rows from A, the 
least square problem minxeR" || Ax — b||| can be approx¬ 
imately solved by minxeR" ||17Ax — 17b||| (Drineas et al., 
2008; 2006b; 2011). At ICML 2014, Ma et al. (2014) stud¬ 
ied CSS for least square approximation from a statistical 
perspective. They showed the expectation and variance 

2svd + 0{£n{k^ + {p — k)^)) was also presented with an error 
bound 0{^yp/£)\\A — Afc|| 2 , where p is the rank of A. 


Let s = (si,...,s„) be a set of scores such that 

Sr=i Si = k one for each column of A. We will 
drawn £ independent samples with replacement from the 
set [n] = n} using a multinomial distribution 

where the probability of choosing the ith column is pi = 
Si / Sj ■ Let ,..., be the indices of £> k selected 
columns and S G R"^^ be the corresponding sampling 

^For the sake of discussion, we are not restricting the sum of 
these scores to be one but to be k, which does not affect our con¬ 
clusions. 

^Note that some of the selected columns could be duplicate. 
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matrix, i.e, 


J 1, if i = ij 

\ 0, otherwise, 


and D £ be a diagonal rescaling matrix with Djj = 



Given S, we construct the C matrix as 


C = AS = {Ai^, ■ ■ ■, Ai^). (4) 

Our interest is to bound the spectral norm error between A 
and Pc A for a given sampling matrix S, i.e., || A — 
where Pc A projects A onto the column space of C. For 
the benefit of presentation, we define ft = SD G to 

denote the sampling-and-rescaling matrix, and 

Y = Afl, n, O 2 = ¥ 2 ^ n, (5) 

where G R^^^ and O 2 £ r(p-^)^^. Since the column 
space of Y is the same to that of C, therefore 
\\A-PcA\\2 = \\A-PyA\\2 
and we will bound \\A — Py^|l 2 in our analysis. Let 
= (vi,..., v„) £ and = (ui,. ■ ■, u„) £ 

]g(p-fe)xn^ It is easy to verify that 

= (vii,...,v,jD, O 2 = 

Finally, we let s* = (s*,..., s*) denote the SLS of A 
relative to the best rank-A: approximation to A (Mahoney, 
2011), i.e., s* = llvilll. It is not difficult to show that 
Eli< = k. 


4. Main Result 


Before presenting our main result, we first characterize 
scores in s by two quantities as follows: 

s* \/W 

c(s) = max —, qfs) = max —- (6) 

l<i<n Si l<i<n Si 

Both quantities compare s to the SLS s*. With c(s) and 
g(s), we are ready to present our main theorem regarding 
the spectral error bound. 


Theorem 1. Let A £ R"*^" has rank p and C £ R"*^^ 
contain the selected columns according to sampling scores 
in s. With a probability 1 — (5 — 2fc exp(—£/[8/cc(s)]), we 
have 


11^ ~ < CTfc+i(l + e(s)) 


where e(s) is 
e(s) = 3 


fc(p + 1 - fc)log [f 
c(s)- -0 -^ 


9(s) 


fclog [f] 


where CTfc+i = ||A — Ak \\2 is the [k + l)f/r singular value 
of A. 


Remark: Clearly, the spectral error bound and the success¬ 
ful probability in Theorem 1 depend on the quantities c(s) 
and q{s). In the subsection below, we study the two quan¬ 
tities to facilitate the understanding of the result in Theo¬ 
rem 1. 


4.1. More about the two quantities and their tradeoffs 


The result in Theorem 1 implies that the smaller the quan¬ 
tities c(s) and q{s), the better the error bound. Therefore, 
we first study when c(s) and q{s) achieve their minimum 
values. The key results are presented in the following two 
lemmas with their proofs deferred to the supplement. 


Lemma 1. The set of scores in s that minimize q{s) is given 


by Si oc i.e., Si = 




Remark: The sampling distribution with probabilities that 
are proportional to the square root of s*, i £ [n] falls in be¬ 
tween the uniform sampling and the leverage-based sam¬ 
pling. 

Lemma 2. c(s) > 1, Vs such that EEi 

scores in s that minimize c(s) is given by Si = s*, and the 

minimum value of c(s) is 1. 


Next, we discuss three special samplings with s (i) propor¬ 
tional to the square root of the SLS, i.e., s^ oc i/sf (re¬ 
ferred to as square-root leverage-based sampling or sqL- 
sampling for short), (ii) equal to the SLS, i.e., Si = s* 
(referred to as leverage-based sampling or L-sampling for 
short), and (iii) equal to uniform scalars Si = k/n (referred 
to as uniform sampling or U-sampling for short). Firstly, if 
Si oc ’ 9(®) achieves its minimum value and we have 
the two quantities written as 


“ z, X! 


2=1 


CsqL = max 


s: E, \/? 


= qsgL ] 


(7) 


In this case, when s* is flat (all SLS are equal), then 
QsqL = aEf ~ bound becomes 

0{y/{p + 1 — k)k/£ + y/nk/P)ak+i that suppresses log¬ 
arithmic terms. To analyze qsgL and CsqL for skewed SLS, 
we consider a power-law distributed SLS, i.e., there ex¬ 
ists a small constant a and power index p > 2, such that 
s'^q,i = 1,... ,n ranked in descending order satisfy 

sfi] < i = l,... ,n 


Then it is not difficult to show that 


1 


E 


Si < 


a 


1 + 


k ^ k \ ' p — 2 

i—l ^ 

which is independent of n. Then the error bound in The¬ 
orem 1 becomes O f ‘ I 


than that in (3). 


+ j ) (Tfc+i, which is better 


Secondly, if Si oc s*, then c(s) achieves its minimum value 
and we have the two quantities written as 

qc = max^=, Ci = 1 (8) 

* vX 

In this case, when s* is flat, we have qc = and cl = 1 
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and the same bound 0 (a/( p + 1 — k)k/£+ ^nklP)ak+i 
follows. However, when s* is skewed, i.e., there exist very 
small SLS, then could be very large. As a comparison, 
the q{s) for sqL-sampling is always smaller than that for 
L-sampling due the following inequality 


2=1 



< max 


1 


= max 



= QL 


Lastly, we consider the uniform sampling Si = — . Then 
the two quantities become 


ns,- 


qu = max —-—-, cu = max —(9) 
i k i k 

Similarly, if s* is flat, qjj = and cjj = 1. Moreover, 
it is interesting to compare the two quantities for the sqL- 
sampling in (7) and for the uniform sampling in (9). 


1 




max • 


= qu 


2=1 


1 7T-S* 

CsqL = max- a/s* a/s* < mp = cu 

i=l 

From the above discussions, we can see that when s* is a 
flat vector, there is no difference between the three sam¬ 
pling scores for s. The difference comes from when s* 
tends to be skewed. In this case, Si oc works al¬ 

most for sure better than uniform distribution and could 
also be potentially better than Si oc s* according to the 
sampling dependent error bound in Theorem 1. A simi¬ 
lar tradeoff between the L-sampling and U-sampling but 
with a different taste was observed in (Maetal., 2014), 
where they showed that for least square approximation by 
CSS leveraging-based least square estimator could have a 
large variance when there exist very small SLS. Nonethe¬ 
less, our bound here exhibits more insights, especially on 
the sqL-sampling. More importantly, the sampling depen¬ 
dent bound renders the flexibility in choosing the sampling 
scores by adjusting them according to the distribution of 
the SLS. In next subsection, we present an optimization 
approach to And better sampling scores. In Figure 1, we 
give a quick view of different sampling strategies. 


4.2. Optimizing the error bound 

As indicated by the result in Theorem 1, in order to achieve 
a good performance, we need to make a balance between 
c(s) an q{s), where c(s) affects not only the error bound but 
also the successful probability. To address this issue, we 
propose a constrained optimization approach. More specif¬ 
ically, to ensure that the failure probability is no more than 


optimization based mixing 


sqL-sampling 
more unifonn 


U-sampling 

I_ 




more skewed 

■ > 


L-sampling 


-^- 

mixing suggested by Ma et al., 2014 


Figure 1. An illustration of different sampling strategies. The 
mixing strategy suggested by (Ma et al., 2014) is a convex combi¬ 
nation of U-sampling and L-sampling. Our optimization approach 
gives an intermediate sampling between the sqL-sampling and the 
L-sampling. 


3(5, we impose the following constraint on c(s) 

£ , /k\ s* £ 

8 fcc(s) ^\(5y * Si 8 fclog(|) 

( 10 ) 

Then we cast the problem into minimizing q(s) under the 
constraint in ( 10 ), i.e.. 


min max 


s.t. s'""! = fc, s* < 7 Si, i = 1 ,..., n ( 11 ) 
It is easy to verify that the optimization problem in (11) 
is convex. Next, we develop an efficient bisection search 
algorithm to solve the above problem with a linear conver¬ 
gence rate. To this end, we introduce a slack variable t and 
rewrite the optimization problem in ( 11 ) as 

min t, s.t. s^l = fe 
seR!Jl.i>0 

( 12 ) 


and 


< min 


in (^ 7 , 


,i= l,...,n 


We now And the optimal solution by performing bisection 
search on t. Let fmax and tj^in be the upper and lower 
bounds for t. We set t = (fmin + fmax)/2 and decide the 
feasibility of t by simply computing the quantity 


/«) = E- 


E (t.'a/T) 

Evidently, f is a feasible solution if f{t) < k and is not if 
f{t) > k. Hence, we will update (max = tif fit) < k and 
fmin =<if/(f) > k. To run the bisection algorithm, we 
need to decide initial fmin and (max- We can set tmin = 0. 
To compute fmax, we make an explicit construction of s by 
distributing the (1 — 7 “^) share of the largest element of s* 
to the rest of the list. More specifically, let j be the index 
for the largest entry in s*. We set Sj = ||s*||oo 7 ~^ and 
Si = s* + (1 - 7 “/||s*||oo/(n - 1) fori / j. Evidently, 
this solution satisfies the constraints s* < 73^,1 € [n] for 
7 > 1. With this construction, we can show that 
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Therefore, we set initial tmax to the value in R.H.S of the 
above inequality. Given the optimal value of t = G we 
compute the optimal value of Si by Si = — 7 =-. 

The corresponding sampling distribution clearly lies be¬ 
tween L-sampling and sqL-sampling. In particular, when 
7=1 the resulting sampling distribution is L-sampling 
due to Lemma 2 and when j —>■ 00 the resulting sampling 
distribution approaches sqL-sampling. 

Finally, we comment on the value of £. In order to make 
the constraint in (10) feasible, we need to ensure 7 > 1. 
Therefore, we need £ > fl(k log (§))■ 

4.3. Subsequent Applications 

Next, we discuss two subsequent applications of CSS, one 
for low rank approximation and one for least square ap¬ 
proximation. 

Rank-fc approximation. If a rank-fc approximation is de¬ 
sired, we need to do some postprocessing since Pc A might 
has rank larger than k. We can use the same algorithm as 
presented in (Boutsidis et al., 2011). In particular, given 
the constructed C G we first orthonormalize the 

columns of C to construct a matrix Q G with or¬ 

thonormal columns, then compute the best rank-fc approx¬ 
imation of Q^A G R^^” denoted by and finally 

construct the low-rank approximation as Q(Q~’'A)k. It was 
shown that (Lemma 2.3 in (Boutsidis et al., 201 1)) 

||A - Q{Q^A)uh < V 2 IIA - 

where 11^ f. (A) is the best approximation to A within the 
column space of C that has rank at most k. The running 
time of above procedure is 0{mni + {m + n)i'^). Regard¬ 
ing its error bound, the above inequality together with the 
following theorem implies that its spectral error bound is 
only amplified by a factor of compared to that of Pc A. 

Theorem 2. Let A G R™^" has rank p and C G R™^^ 
contain the selected columns according to sampling scores 
in s. With a probability 1 — S — 2fc exp(—£/[8/cc(s)]), we 
have 

11^ ~ A^c,k{A )\\2 < o-fc+i(l -I- e(s)) 
where e(s) is given in Theorem 1. 

Least Square Approximation. CSS has been used in 
least square approximation for developing faster and in¬ 
terpretable algorithms. In these applications, an over¬ 
constrained least square problem is considered, i.e., given 
A G and b G R™ with n, to solve the follow¬ 

ing problem: 

Xopt = arg min || Ax - h\\l (13) 

The procedure for applying CSS to least square approxima¬ 
tion is (i) to sample a set of £ > n rows from A and form 


a sampling-and-rescaling matrix denoted by G R^^™ 
(ii) to solve the following reduced least square problem: 

XoDt = arg min linAx — nbllo (14) 

xeR" 

It is worth pointing out that in this case the SLS s* = 
(s*,..., s^) are computed based on the the left singular 
vectors [/ of A by s* = ||C/i*|| 2 , where Li* is the i-th 
row of U. One might be interested to see whether we can 
apply our analysis to derive a sampling dependent error 
bound for the approximation error ||xopt — Xopt ||2 sim¬ 
ilar to previous bounds of the form ||xopt — Xopt ||2 < 
7 —^-^||Axiop — b|| 2 . Unfortunately, naively combining 
our analysis with previous analysis is a worse case analysis, 
and consequentially yields a worse bound. The reason will 
become clear in our later discussions. However, the statis¬ 
tical analysis in (Ma et al., 2014) does indicate that Xopt by 
using sqL-sampling could have smaller variance than that 
using L-sampling. 

5. Numerical Experiments 

Before delving into the detailed analysis, we present some 
experimental results. We consider synthetic data with the 
data matrix A generated from one of the three different 
classes of distributions introduced below, allowing the SLS 
vary from nearly uniform to very nonuniform. 

• Nearly uniform SLS (GA). Columns of A are 

generated from a multivariate normal distribution 
A/'(lm,S), where = 2 * This data is 

referred to as GA data. 

• Moderately nonuniform SLS (T 3 ). Columns of A are 
generated from a multivariate f-distribution with 3 de¬ 
gree of freedom and covariance matrix E as before. 
This data is referred to as T 3 data. 

• Very nonuniform SLS (Ti). Columns of A are gen¬ 
erated from a multivariate f-distribution with 1 degree 
of freedom and covariance matrix E as before. This 
data is referred to as Ti data. 

These distributions have been used in (Ma et al., 2014) to 
generate synthetic data for empirical evaluations. 

We first compare the spectral norm reconstruction error 
of the three different samplings, namely L-sampling, U- 
sampling and the sqL-sampling, and the deterministic dual 
set spectral sparsification algorithm. We generate synthetic 
data with n = m = 1000 and repeat the experiments 1000 
times. We note that the rank of the generated data matrix 
is 1000. The averaged results are shown in Figure 2. From 
these results we observe that (i) when the SLS are nearly 
uniform, the three sampling strategies perform similarly 
as expected; (ii) when the SLS become nonuniform, sqL- 

"'We abuse the same notation SI. 
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Figure 2. Comparison of the spectral error for different data, dif¬ 
ferent samplings, different target rank and different sample size. 


T1, k=10,1=500 T1, k=20,1=500 T1, k=40,1=500 



Figure 3. The spectral error for the sampling probabilities found 
by the constrained optimization approach with different values of 
7 > 1. The left most point corresponds to sqL-sampling and the 
right most point corresponds to L-sampling. 

sampling performs always better than U-sampling and bet¬ 
ter than the L-sampling when the target rank is small (e.g., 
k = 10) or the sample size £ is large; (iii) when the SLS are 
non-uniform, the spectral norm reconstruction error of sqL- 
sampling decreases faster than L-sampling w.r.t the sample 
size (iv) randomized algorithms generally perform better 
than the deterministic dual set sparsification algorithm. 

Second, we compare the sampling scores found the con¬ 
strained optimization with L-sampling and sqL-sampling. 
We vary the value of 7 from 1 (corresponding to L- 
sampling) to cxd (corresponding to sqL-sampling). A result 
with sampling size £ = 500 is shown in Figure 3. It demon¬ 
strate that intermediate samplings found by the proposed 
constrained optimization can perform better than both L- 
sampling and sqL-sampling. 

Finally, we apply CSS to over-constrained least square re¬ 
gression. To this end, we generate a synthetic data ma¬ 
trix A € with m = 50 and n = 1000 similarly 

to (Ma et ah, 2014). The output is generated by y = fi+ 
e where e (0, 9/„) and P = (lio, O.II30, lio)^- We 
compare the variance and bias of the obtained estimators 
over 1000 runs for different sampling distributions. The 
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Figure 4. Comparison of variance and squared bias of the estima¬ 
tors for different data, different samplings and different sample 
size. 
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Figure 5. Comparison of variance and squared bias of the estima¬ 
tors for different mixing strategies. Opt refers to our optimiza¬ 
tion based approach and LU refers to a convex combination of L- 
sampling and U-sampling with 7 “^ as the combination weight. 

results shown in Figure 4 demonstrate the sqL-sampling 
gives smaller variance and better bias of the estimators than 
L-sampling and U-sampling. We also compare the pro¬ 
posed optimization approach with the simple mixing strat¬ 
egy (Ma et al., 2014) that uses a convex combination of the 
L-sampling and the U-sampling. The results are shown in 
Figure 5, which again support our approach. 

More results including relative error versus varying size n 
of the target matrix, performance on a real data set and the 
Frobenius norm reconstruction error can be found in sup¬ 
plement. 


6. Analysis 

In this section, we present major analysis of Theorem 1 and 
Theorem 2 with detailed proofs included in supplement. 
The key to our analysis is the following Theorem. 


Theorem 3. Let Y, Ui, £1,2 be defined in (5). Assume that 
Ui has full row rank. We have 


||4l-PyA||^ < IIE 2 III+ 
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and 

IIS2III + 

where ^ could be 2 and F. 




2 


The first inequality was proved in (Halko et al., 2011) 
(Theorem 9.1) and the second inequality is credited 
to (Boutsidis et al., 2011) (Lemma 3.2) Previous work 
on the spectral norm analysis also start from a similar 
inequality as above. They bound the second term by 
using ||I]2fl2f^III2 < IIS2O2II2III2III2 and then bound 
the two terms separately. However, we will first write 




= 1122^220]^^ ) ^||2 using the fact Hi 


has full row rank, and then bound ||(HiH 7)“^||2 and 
II^^2^^!^II2 separately. To this end, we will apply the Ma¬ 
trix Chernoff bound as stated in Theorem 4 to bound 
||(HiH ]^)“^||2 and apply the matrix Bernstein inequality 
as stated in Theorem 5 to bound ||H2H]'^ ||2. 


6.1.Bounding II ^||2 

We will utilize Theorem 4 to bound Define 

Xi = ViV^ /Si. It is easy to verify that 

t i 

■ 1 - 1 
J=1 ^ J=1 

and F[Xi.] = ^ where we use 
12^=1 ^ V-^^Vi = Ik- Therefore we have 

Ainin(E[Xi^]) = i. Then the theorem below will follow 
Theorem 4. 

Theorem 6. With a probability 1 — k exp(—i5^f/[2fcc(s)]), 
we have 

Amin(i^li^7) ^ (1 ~ 

Therefore, with a probability 1 — k exp(—(5^^/[2fcc(s)]) we 
have ||(HiH 7)-1|12 < 


Theorem 4 (Matrix Chernoff (Tropp, 2012)). Let X be a 
finite set of PSD matrices with dimension k, and suppose 
that maxxeA:’ Ainax(^) < B. Sample {Xi ,..., Xe} inde¬ 
pendently from X. Compute 

Mmax — ^Amax(E[ALi]), pmin — -^Amin (E[2f i]) 

Then 


6.2. Bounding IIH2H7II2 


We will utilize Theorem 5 to bound ||H2H7||2- 
Then 
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Define 


PrE„,ax > (1 + 5 )m 


max ( 


<k 


(l + <5) 


1-1-5 


and E[Zj] =0. In order to use the matrix Bernstein in¬ 
equality, we will bound max^ ||.^i||2 = max^ < 


Pr 





i=l 


— (1 





^min 

B 


q(s) and Tj < ^fc)c(s) ^ rpjjgjj prove the follow¬ 

ing theorem. 

Theorem 7. With a probability \ — 5,we have 


Theorem 5 (Noncommutative Bernstein Inequality (Recht, 
2011)). Let be independent zero-mean ran¬ 

dom matrices of dimension di X ^2. Suppose tJ = 
max{||E[ZjZ7]||2j|E[^7^j||2} and \\Zj \\2 < M al¬ 
most surely for all k. Then, for any e > 0, 


Pr 


L 

E^. 

> e 

< (di-|-d 2 ) exp 

-eV2 



i=i 

2 


U:U^] + Me/ll\ 


Following immediately from Theorem 3, we have 
P - PyA\\2 < afc+1-^1 + IIH2 HT(HiHT)-i||2 


p2^^7|2 < 




2g(s) log(f) 
3 


We can complete the proof of Theorem 1 by combining 
the bounds for IIH2H7II2 and A~7(^i^7) by setting 
5 = 1/2 in Theorem 6 and using union bounds. 


7. Discussions and Open Problems 

From the analysis, it is clear that the matrix Bernstein in¬ 
equality is the key to derive the sampling dependent bound 
for IIH2H71|2- For bounding A„iin(f^if^7)’ similar analysis 
using matrix Chernoff bound has been exploited before for 
randomized matrix approximation (Gittens, 2011). 


< ak+i\Jl + IIH 2 H 7 ||iA^7(HiH7) 

< CTfc-|-l(l + p2f^i^ I|2 AP„(HiH7)), 

where the last inequality uses the fact s/of < a -|- 

b. Below we bound Aniin(^^i^^7) from below and bound 
p2f27II2 from above. 

^In fact, the first inequality is implied by the second inequality. 


Since Theorem 3 also holds for the Frobenius norm, it 
might be interested to see whether we can derive a sam¬ 
pling dependent Frobenius norm error bound that depends 
on c(s) and q{s), which, however, still remains as an 
open problem for us. Nonetheless, in experiments (in¬ 
cluded in the supplement) we observe similar phenom¬ 
ena about the performance of L-sampling, U-sampling and 
sqL-sampling. 
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Finally, we briefly comment on the analysis for least square 
approximation using CSS. Previous results (Drineas et al., 
2008; 2006b; 2011) were built on the structural conditions 
that are characterized by two inequalities 
Xrnini^UU^n) > 1/^2 

The first condition can be guaranteed by Theorem 6 with 
a high probability. For the second condition, if we adopt a 
worse case analysis 

\\u^n^nu-^u-^^hg < \\u^n^nu-^g\\u-^^hg 

and bound the first term in R.H.S of the above inequality 
using Theorem 7, we would end up with a worse bound 
than existing ones that bound the left term as a whole. 
Therefore the naive combination can’t yield a good sam¬ 
pling dependent error bound for the approximation error of 
least square regression. 

8. Conclusions 

In this paper, we have presented a sampling dependent 
spectral error bound for CSS. The error bound brings a new 
distribution with sampling probabilities proportional to the 
square root of the statistical leverage scores and exhibits 
more tradeoffs and insights than existing error bounds for 
CSS. We also develop a constrained optimization algorithm 
with an efficient bisection search to find better sampling 
probabilities for the spectral norm reconstruction. Numer¬ 
ical simulations demonstrate that the new sampling distri¬ 
butions lead to improved performance. 

References 

Boutsidis, Christos, Mahoney, Michael W., and Drineas, 
Petros. An improved approximation algorithm for the 
column subset selection problem. In Proceedings of the 
Twentieth Annual ACM-SIAM Symposium on Discrete 
Algorithms, pp. 968-977, 2009. 

Boutsidis, Christos, Drineas, Petros, and Magdon-Ismail, 
Malik. Near optimal column-based matrix reconstruc¬ 
tion. In The Annual Symposium on Foundations of Com¬ 
puter Science, pp. 305-314, 2011. 

Deshpande, Amit and Rademacher, Luis. Efficient vol¬ 
ume sampling for row/column subset selection. CoRR, 
abs/1004.4057, 2010. 

Drineas, Petros and Mahoney, Michael W. On the nystrom 
method for approximating a gram matrix for improved 
kernel-based learning. Journal of Machine Learning Re¬ 
search, 6:2005,2005. 

Drineas, Petros, Mahoney, Michael W., and Muthukrish- 
nan, S. Subspace sampling and relative-error matrix 


approximation: Column-based methods. In APPROX- 
RANDOM, volume 4110, pp. 316-326, 2006a. 

Drineas, Petros, Mahoney, Michael W., and Muthukrish- 
nan, S. Sampling algorithms for 12 regression and ap¬ 
plications. In ACM-SIAM Symposium on Discrete Algo¬ 
rithms (SODA), pp. 1127-1136, 2006b. 

Drineas, Petros, Mahoney, Michael W., and Muthukr- 
ishnan, S. Relative-error cur matrix decompositions. 
SIAM Journal Matrix Analysis Applications, 30:844- 
881,2008. 

Drineas, Petros, Lewis, Jamey, and Paschou, Peristera. 
Inferring geographic coordinates of origin for Euro¬ 
peans using small panels of ancestry informative mark¬ 
ers. PLoS ONE, 5(8):el 1892,2010. 

Drineas, Petros, Mahoney, Michael W., Muthukrishnan, 
S., and Sarlos, Tamas. Paster least squares approxima¬ 
tion. Numerische Mathematik, 117(2):219-249, Pebru- 
ary 2011. 

Drineas, Petros, Magdon-Ismail, Malik, Mahoney, 
Michael W., and Woodruff, David P. Past approximation 
of matrix coherence and statistical leverage. Journal of 
Machine Learning Research, 13:3475-3506,2012. 

Prieze, Alan, Kannan, Ravi, and Vempala, Santosh. Past 
monte-carlo algorithms for finding low-rank approxima¬ 
tions. Journal of ACM, 51(6):1025-1041,2004. 

Gittens, Alex. The spectral norm errors of the naive nys¬ 
trom extension. CoRR, abs/1110.5305, 2011. 

Gittens, Alex and Mahoney, Michael W. Revisiting the nys¬ 
trom method for improved large-scale machine learning. 
In Proceedings of International Conference of Machine 
Learning, volume 28, pp. 567-575, 2013. 

Gu, Ming and Eisenstat, Stanley C. Efficient algorithms 
for computing a strong rank-revealing qr factorization. 
SIAM Journal on Scientific Computing, 17(4):848-869, 
1996. 

Guruswami, Venkatesan and Sinop, Ali Kemal. Optimal 
column-based low-rank matrix reconstruction. In Pro¬ 
ceedings of the Twenty-third Annual ACM-SIAM Sympo¬ 
sium on Discrete Algorithms, pp. 1207-1214, 2012. 

Halko, N., Martinsson, P. G., and Tropp, J. A. Pinding 
structure with randomness: Probabilistic algorithms for 
constructing approximate matrix decompositions. SIAM 
Review, 53:217-288,2011. 

laved. A., Drineas, P, Mahoney, M. W., and Paschou, P. Ef¬ 
ficient genomewide selection of PCA-correlated tSNPs 
for genotype imputation. Annals of Human Genetics, 75 
(6):707-722, Nov 2011. 



Sampling Dependent Spectral Error Bound for CSS 


Ma, Ping, Mahoney, Michael W., and Yu, Bin. A statistical 
perspective on algorithmic leveraging. In Proceedings of 
the 31th International Conference on Machine Learning 
('/CMLj,pp. 91-99, 2014. 

Mahoney, Michael W. Randomized algorithms for matrices 
and data. Foundations and Trends in Machine Learning, 
3(2); 123-224,2011. 

Pan, C.-T. On the existence and computation of rank- 
revealing lu factorizations. Linear Algebra and its Appli¬ 
cations, 316(13);199 - 222, 2000. Special Issue: Con¬ 
ference celebrating the 60th birthday of Robert J. Plem- 
mons. 

Pan, Ching-Tsuan and Tang, PingTakPeter. Bounds on sin¬ 
gular values revealed by qr factorizations. BIT Numer¬ 
ical Mathematics, 39(4);740-756, 1999. ISSN 0006- 
3835. 

Paschou, Peristera, Mahoney, Michael W., laved. A., Kidd, 
J. R., Pakstis, A. J., Gu, S., Kidd, K. K., and Drineas, 
Petros. Intra- and interpopulation genotype reconstruc¬ 
tion from tagging SNPs. Genome Research, 17(1):96- 
107, Jan 2007a. 

Paschou, Peristera, Ziv, Elad, Burchard, Esteban G., 
Choudhry, Shweta, Rodriguez-Cintron, William, Ma¬ 
honey, Michael W., and Drineas, Petros. PCA-correlated 
SNPs for structure identification in worldwide human 
populations. PLoS Genetics, 3(9);el60H-, 2007b. 

Recht, Benjamin. A simpler approach to matrix comple¬ 
tion. Journal Machine Learning Research (JMLR), pp. 
3413-3430,2011. 

Tropp, Joel A. User-friendly tail bounds for sums of ran¬ 
dom matrices. Found. Comput. Math., 12(4):389^34, 
August 2012. ISSN 1615-3375. 

Wang, Shusen and Zhang, Zhihua. A scalable cur matrix 
decomposition algorithm: Lower time complexity and 
tighter bound. In Advances in Neural Information Pro¬ 
cessing Systems 25, pp. 656-664. 2012. 

Wang, Shusen and Zhang, Zhihua. Improving cur matrix 
decomposition and the nystrom approximation via adap¬ 
tive sampling. Journal of Machine Learning Research, 
14(1):2729-2769,2013. 



